perf(vllm): optimize MiniMax M3 inference on MI300X by Oseltamivir · Pull Request #1782 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-06-15T19:13:24Z

Summary

keep this PR stacked on the current [Experimental][DNM till upstream PR merges][AMD] perf: load-time block FP8 MoE for MiniMax M3 on MI300X #1753 head (6f5a3991), which supplies the load-time 128x128 block-FP8 conversion
replace the old route-compaction-only patch with the cumulative MI300X runtime patch validated during profiling
optimize MiniMax M3 sparse attention, index scoring, FP8 MoE routing/reduction, router projection, and residual collectives
enable the pinned AITER Gemma all-reduce + RMSNorm path only for TP8; EP8 remains on the faster native collective
use a 32K scheduler token budget only for measured long-prompt points (ISL >= 8192 && CONC >= 16)

This PR contains no profiling configuration and does not modify perf-changelog.yaml.

Profile basis

The final all-rank 8k1k/c256 EP profile shows one kernel stream and no compute/communication overlap window:

Profile	Decode step	GPU busy	Collective + norm
native EP collective	106.324 ms	99.2%	63.723 ms
AITER fused EP collective	109.257 ms	99.1%	66.088 ms

Native EP is 2.7% faster per profiled decode step. Across the profiled 256-request batch it improves output throughput by 6.4%, mean TTFT by 6.8%, and mean TPOT by 2.5%.

The remaining native critical path is:

native all-reduce: 62.913 ms
block-FP8 MoE experts: 17.361 ms
sparse index score: 7.920 ms
sparse attention decode: 5.401 ms

The dependencies are serial at the block boundaries, so moving these kernels to another stream would not hide useful work. The implementation instead removes work from each stage and fuses only where the measured dependency permits it.

Optimizations

compact EP routes from 128 global experts to the 16 experts owned by each rank
tighten route padding and use the route-aware fused MoE reduction
defer FFN reductions into the following Gemma RMSNorm boundary
use a gfx942 FP32 router projection kernel for the exact MiniMax M3 decode shape
tune the MI300X E16 block-FP8 expert configuration
replicate the TP input embedding to remove its startup all-reduce
use the AITER fused all-reduce + Gemma RMSNorm only on the measured TP8 shape
retain native collectives for EP, where the AITER fusion regresses both attention and FFN boundaries

All optimized paths are gated to the profiled MiniMax M3/gfx942 shapes. Other models, platforms, parallel modes, and unsupported shapes retain the existing path.

Performance

MI300X output throughput, aggregate across 8 GPUs:

Configuration	Baseline	Final	Change
8k1k EP8 c256, main run 27510667862	1,066.3 tok/s	1,695.8 tok/s	+59.0%
8k1k EP8 c256, regressed run 27569397626	883.9 tok/s	1,695.8 tok/s	+91.9%

The action rows use the regular InferenceX sweep request count. The final row is a warmed production-image spot check with 256 fixed-length requests; the same-harness component run improved from 1,391.9 to 1,512.8 tok/s (+8.7%) before the final production-image validation.

Production-image spot checks with the exact committed patch:

Configuration	Output throughput
8k1k EP8 c256	1,695.8 tok/s
8k1k TP8 c16	761.5 tok/s
32k1k TP8 c16	402.3 tok/s

Component A/B results:

TP8 AITER fusion: +5.1% at 8k1k/c16 and +1.7% at 32k1k/c16
32K scheduler budget: +4.0% at 8k1k TP c16, +4.5% at 8k1k EP c256, and +5.8% at 32k1k TP c16
the scheduler override is intentionally disabled for 1k prompts, where it regressed TP c16 by 3.1%

Validation

production squash vllm/vllm-openai-rocm:minimax-m3
runtime patches apply sequentially over image revision 4a560dd8db67c270f5e2afb614558271b76f2294
all 19 generated runtime files match the validated vLLM tree byte-for-byte
patched production image served TP8 and EP8 successfully with CUDA graphs
pinned AITER source built and initialized its custom communicator on all 8 TP ranks
git diff --check
bash -n and ShellCheck
python -m pytest utils/matrix_logic/ -v: 156 passed
MI300X MiniMax 1k1k/8k1k config generation: 36 entries

Note

Medium Risk
Large inference-runtime patch changes MoE routing, collectives, and model forward semantics on a gated path; wrong gating could affect numerics or parallelism, but scope is limited to profiled MiniMax M3 MI300X configurations.

Overview
Adds a second runtime patch (minimaxm3_mi300x_profiled.patch) on top of the existing MXFP8 block-FP8 patch, and refactors the MI300X benchmark script to apply both patches generically, optionally install a pinned AITER build for TP8-only fused all-reduce + Gemma RMSNorm, and pass --max-num-batched-tokens 32768 when ISL >= 8192 and CONC >= 16.

The patch targets profiled MiniMax M3 / gfx942 shapes: EP8 MoE route compaction and tuned block-FP8 expert configs; a gfx942 small-batch router GEMM; Triton tweaks to sparse attention and index scoring; deferred FFN all-reduces fused into the next Gemma norm boundary on TP8; and replicated input embeddings on MI300X TP8 to drop an extra collective. AITER Gemma fusion stays off for EP and non-TP8; native collectives remain there.

Gating is explicit (parallel mode, token counts, hidden size 6144, etc.) so other models and platforms keep prior behavior, with fallbacks to unfused all-reduce + norm where fast paths do not apply.

^{Reviewed by Cursor Bugbot for commit 87e92e2. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-15T19:13:41Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-06-16T17:53:52Z

Claude finished @Oseltamivir's task in 7m 0s —— View job

Review of PR #1782

LGTM — no blocking issues found.

Summary: This is a well-structured incremental optimization for MiniMax M3 EP8 on MI300X. The patch adds two complementary paths (short-context native/BF16 dispatch via the oracle, long-context sparse BF16 emulation with fused SwiGLU), and the gating logic correctly ensures each path is only reached for the profiled configuration. Key observations:

Correctness of the fused gated kernel: fused_moe_gated_kernel correctly computes both gate and up projections in a single pass, preserves the BF16 store/reload boundary before applying SwiGLU in FP32 (matching the unfused numerical behavior), and writes per-route rows to the activation buffer. The b_up_ptrs = b_gate_ptrs + N * stride_bn offset correctly addresses the split layout.
Local-expert allocation tightening: The num_local_experts parameter in moe_align_block_size correctly reduces the allocation from 128 * (block_size-1) to 16 * (block_size-1) padding tokens. The _max_post_padded bound is sound — min(allocation, M*block_m, M + E*(block_m-1)) takes the tightest of three independent bounds.
EP route filtering in emulation path: The _apply_sparse_mi300x_ep method correctly passes ignore_invalid_experts=True and num_local_experts=E to _prepare_expert_assignment, so remote-expert routes are dropped from sorted_token_ids. The downstream moe_fused_mul_sum with apply_weights=False and expert_map correctly zeros contributions from non-local experts (via tl.where(expert_mask, b_val, 0.0)), so unwritten rows in the workspace don't leak garbage.
Fused top-k atomic path in native MXFP8: FUSE_TOPK=True uses tl.atomic_add on a zero-initialized output with c_row = offs_token // top_k, correctly accumulating weighted per-route results into per-token output rows. The M_routed % top_k divisibility check guards against misuse.
Oracle routing: Short-context EP8 (max_model_len <= 4096) routes to Mxfp8NativeTritonExperts with mixed native/BF16 weights; long-context EP8 routes to Mxfp8EmulationTritonExperts with the sparse BF16 path. Both guards (_should_use_native_ep, _is_minimax_m3_mi300x_ep8) are scoped to the exact MiniMax M3 config.
Benchmark script: Patch application follows the existing idempotent marker pattern. EP flag is correctly conditional on EP_SIZE. Server launch uses proper line continuations.
Repo guidelines: No master config or perf-changelog changes needed (this PR only touches the benchmark script and runtime patch). No MTP or model prefix concerns.

claude · 2026-06-17T01:09:54Z

Claude finished @Oseltamivir's task in 4m 57s —— View job

Review of PR #1782

Read repository guidelines (CLAUDE.md)
Gather context and read changed files
Review benchmark shell script changes
Review runtime patch file
Validate against PR review checklist
Post review summary

LGTM — no blocking issues found.

Summary: This is a well-scoped incremental EP8 decode optimization stacked on #1753. The patch adds two complementary paths: short-context EP8 routes to Mxfp8NativeTritonExperts with the profiled native/BF16 dispatch; long-context EP8 routes to Mxfp8EmulationTritonExperts which now compacts decode batches (≤256 tokens) to locally-owned experts with BM16 tiles, bypassing the overpadded generic path. Key observations:

Local-route compaction in _apply_sparse_mi300x_ep: The _prepare_expert_assignment call with ignore_invalid_experts=True and num_local_experts=E correctly drops remote-expert routes from sorted_token_ids. Buffer management is sound — workspace2 is reused for intermediate_gate_up (GEMM1 output) and intermediate_output (GEMM2 output) with the activation fully consuming the former before the latter is written. GEMM2 passes top_k=1 (treating each route as an independent token) with mul_routed_weight=True, then moe_fused_mul_sum with apply_weights=False sums per-token without double-applying weights, zeroing non-local experts via expert_map.
_max_post_padded bound: The three-way min(allocation, M_routed * block_m, M_routed + E * (block_m - 1)) is sound — the first is the buffer size, the second the original bound, the third the tighter per-expert padding bound. The block-alignment floor is correct.
moe_align_block_size tightening: When ignore_invalid_experts, expert_map, and num_local_experts are all set, padding allocation drops from global_experts * (block_size - 1) to local_experts * (block_size - 1). The 0 < num_local_experts <= num_experts validation prevents misuse.
Fused top-k atomic in _mxfp8_grouped_gemm_*_kernel: c_row = offs_token // top_k correctly maps route-indexed rows to token-indexed output, tl.atomic_add with zero-initialized output accumulates concurrent routes, and the M_routed % top_k divisibility check is validated before launch.
Route-aware SwiGLU kernel (_swiglu_oai_quant_routed_kernel): Processes only locally-routed rows via sorted_token_ids, with proper padding/remote masking. Gate is clamped from above only (gate * sigmoid → 0 for negative gate, so lower clamp is a no-op), up is symmetrically clamped — matching the SwiGLU-OAI numeric contract.
Oracle routing: Short-context EP8 (≤4096 max_model_len) → Mxfp8NativeTritonExperts; long-context EP8 → Mxfp8EmulationTritonExperts. Both guards are scoped to the exact profiled MiniMax M3 gfx94x shape. The bf16_weights_available flag prevents using uninitialized BF16 weights in long-context EP8 where they aren't retained.
Decode gating: The use_sparse_ep predicate in Mxfp8EmulationTritonExperts.apply correctly gates on model match, BF16 dtype, ≤256 tokens, SwiGLU activation, expert_map presence, no router-weight-on-input, and no LoRA. Prefill and mixed batches fall through to the generic TritonExperts path.
Benchmark script: EP patch application follows the existing idempotent marker pattern. EP flag is correctly conditional on EP_SIZE. No master config or perf-changelog changes are included (as documented in scope).

Oseltamivir · 2026-06-19T07:46:22Z

Optimized MI300X-only sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27812712075

Matrix: c1, c16, and c256 for each of 1k1k and 8k1k (TP8 at c1/c16, EP8 at c256), using optimized commit 87e92e28. No eval or non-MI300X jobs.

github-project-automation Bot added this to InferenceMAX Board Jun 15, 2026

Oseltamivir marked this pull request as ready for review June 16, 2026 17:53

Oseltamivir added the full-sweep-enabled label Jun 16, 2026

Oseltamivir marked this pull request as draft June 16, 2026 20:26

Oseltamivir removed the full-sweep-enabled label Jun 16, 2026

Oseltamivir changed the title ~~perf(vllm): fuse MiniMax M3 BF16 EP experts on MI300X~~ perf(vllm): compact MiniMax M3 EP decode routes on MI300X Jun 16, 2026

Oseltamivir marked this pull request as ready for review June 17, 2026 01:09

Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch from d1638a0 to 465ff47 Compare June 17, 2026 20:51

Oseltamivir requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners June 17, 2026 20:51

Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch 6 times, most recently from 95e79da to 27510c4 Compare June 17, 2026 21:47

Oseltamivir changed the title ~~perf(vllm): compact MiniMax M3 EP decode routes on MI300X~~ perf(vllm): compact MiniMax M3 block-FP8 EP routes on MI300X Jun 18, 2026

Oseltamivir added full-sweep-enabled and removed full-sweep-enabled labels Jun 18, 2026

Oseltamivir changed the title ~~perf(vllm): compact MiniMax M3 block-FP8 EP routes on MI300X~~ perf(vllm): optimize MiniMax M3 inference on MI300X Jun 19, 2026

perf(vllm): optimize MiniMax M3 MI300X inference

87e92e2

Oseltamivir force-pushed the codex/minimax-m3-mi300x-ep-mxfp8 branch from 2b449ab to 87e92e2 Compare June 19, 2026 07:30

Oseltamivir marked this pull request as draft June 19, 2026 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vllm): optimize MiniMax M3 inference on MI300X#1782

perf(vllm): optimize MiniMax M3 inference on MI300X#1782
Oseltamivir wants to merge 1 commit into
feat/m3-mi300x-mxfp8from
codex/minimax-m3-mi300x-ep-mxfp8

Oseltamivir commented Jun 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Oseltamivir commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Jun 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Profile basis

Optimizations

Performance

Validation

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #1782

Uh oh!

claude Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #1782

Uh oh!

Oseltamivir commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Jun 15, 2026 •

edited by cursor Bot

Loading

claude Bot commented Jun 16, 2026 •

edited

Loading

claude Bot commented Jun 17, 2026 •

edited

Loading